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Abstract Hand pose estimation has matured rapidly 
in recent years. The introduction of commodity depth 
sensors and a multitude of practical applications have 
spurred new advances. We provide an extensive analy¬ 
sis of the state-of-the-art, focusing on hand pose esti¬ 
mation from a single depth frame. To do so, we have 
implemented a considerable number of systems, and 
will release all software and evaluation code. We sum¬ 
marize important conclusions here: (1) Pose estimation 
appears roughly solved for scenes with isolated hands. 
However, methods still struggle to analyze cluttered 
scenes where hands may be interacting with nearby ob¬ 
jects and surfaces. To spur further progress we intro¬ 
duce a challenging new dataset with diverse, cluttered 
scenes. (2) Many methods evaluate themselves with dis¬ 
parate criteria, making comparisons difficult. We define 
a consistent evaluation criteria, rigorously motivated by 
human experiments. (3) We introduce a simple nearest- 
neighbor baseline that outperforms most existing sys¬ 
tems. This implies that most systems do not general¬ 
ize beyond their training sets. This also reinforces the 
under-appreciated point that training data is as impor¬ 
tant as the model itself. We conclude with directions 
for future progress. 
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1 Introduction 

Human hand pose estimation empowers many practical 
applications, for example sign language recognition [20] . 
visual interfaces [23], and driver analysis [27]. Recently 
introduced consumer depth cameras have spurred a 

Motivation: Recent methods have demonstrated im¬ 
pressive results. But differing (often in-house) testsets, 
varying performance criteria, and annotation errors im¬ 
pede reliable comparisons [26]. Indeed, a recent meta¬ 
level analysis of object tracking papers reveals that 
it is difficult to trust the “best” reported method in 
any one paper [29]. In the field of object recognition, 
comprehensive benchmark evaluation has been vital for 
progress [T3l[TTll8] . Our goal is to similarly diagnose the 
state-of-affairs, and to suggest future strategic direc¬ 
tions, for depth-based hand pose estimation. 

Contributions: Foremost, we contribute the most ex¬ 
tensive evaluation of depth-based hand pose estimators 
to date. We evaluate 13 state-of-the-art hand-pose es¬ 
timation systems across 4 testsets under uniform scor¬ 
ing criteria. Additionally, we provide a broad survey of 
contemporary approaches, introduce a new testset that 
addresses prior limitations, and propose a new baseline 
for pose estimation based on nearest-neighbor (NN) ex¬ 
emplar volumes. Surprisingly, we find that NN exceeds 
the accuracy of most existing systems. We organize our 
discussion along three axes: test data (Sec. [^, train¬ 
ing data (Sec.[^, and model architectures (Sec.[^. We 
survey and taxonomize approaches for each dimension, 
and also contribute novelty to each dimension (e.g. new 
data and models). After explicitly describing our ex- 
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ICL NYU Ours UCI-EGO 



Fig. 1 NN Memorization: We evaluate a broad collection 
of hand pose estimation algorithms on different training and 
testsets under consistent evaluation criteria. Test sets which 
contained limited variety, in pose and range, or which lacked 
complex backgrounds were notably easier. To aid our analy¬ 
sis, we introduce a simple 3D exemplar (nearest-neighbor) 
baseline that both detects and estimates pose suprisingly 
well, outperforming most existing systems. We show the best¬ 
matching detection window in (b) and the best-matching ex¬ 
emplar in (c). We use our baseline to rank dataset difficulty, 
compare algorithms, and illustrate the importance of training 
set design. We provide a detailed analysis of which problem 
types are currently solved, what open research challenges re¬ 
main, and provide suggestions for future model architectures. 


perimental protocol (Sec.[^, we end with an extensive 
empirical analysis (Sec.[^. 

Preview: We foreshadow our conclusions here. When 
hands are easily segmented or detected, current systems 
perform quite well. However, hand “activities” involv¬ 
ing interactions with objects/surfaces are still challeng¬ 
ing (motivating the introduction of our new dataset). 
Moreover, in such cases even humans perform imper¬ 
fectly. For reasonable error measures, annotators dis¬ 
agree 20% of the time (due to self and inter-object oc¬ 
clusions and low resolution). This has immediate impli¬ 
cations for test benchmarks, but also imposes a chal¬ 
lenge when collecting and annotating training data. 
Finally, our NN baseline illustrates some surprising 
points. Simple memorization of training data performs 
quite well, outperforming most existing systems. Vari¬ 
ations in the training data often dwarf variations in the 
model architectures themselves (e.g., decision forests 
versus deep neural nets). Thus, our analysis offers the 
salient conclusion that “it’s all about the (training) 
data”. 

Prior work: Our work follows in the rich tradition of 
benchmarking pTlfSTllQ] and taxiomatic analysis [38l 
[To] . In particular, Frol et al. HD] provided a review of 
hand pose analysis in 2007. Contemporary approaches 
have considerably evolved, prompted by the introduc¬ 
tion of commodity depth cameras. We believe the time 


Dataset 

Chal. 

Sen. 

Annot. 

Frms. 

Sub. 

Cam. 

Dist. (mm) 

ASTAR 1^ 

Dexter 1 142 II 

A 

1 

435 

435 

10 

ToF 

270-580 

A 

1 

3,157 

3,157 

1 

Both 

100-989 

MSRA [33] 
ICL [45] 

A 

1 

2,400 

2,400 

6 

ToF 

339-422 

A 

1 

1,599 

1,599 

1 

Struct 

200-380 

FORTH 12^ 

AV 

1 

0 

7,148 

5 

Struct 

200-1110 

NYU [47] 

AV 

1 

8,252 

8,252 

2 

Struct 

510-1070 

KTH M 

UCI-EGO [35] 

AVC 

1 

0 

46,000 

9 

Struct 

NA 

AVC 

4 

364 

3,640 

2 

ToF 

200-390 

Ours 

AVC 

10+ 

23,640 

23,640 

10 

Both 

200-1950 


Challenges (Chal.): A-Articulation V- 


Viewpoint C-Clutter 

Table 1 Testing data sets: We group existing benchmark 
testsets into 3 groups based on the overall challenges ad¬ 
dressed - articulation, viewpoint, and/or background clutter. 
We also tabulate the number of captured scenes, number of 
annotated versus total frames, number of subjects, camera 
type (structured light vs time-of-flight), and distance of the 
hand to camera. We introduce a new dataset (Ours) that 
contains a significantly larger range of hand depths (up to 
2m), more scenes (10+), more annotated frames (24K), and 
more subjects (10) than prior work. 


is right for another look. We do extensive cross-dataset 
analysis (by training and testing systems on different 
datasets |48]). Human-level studies in benchmark eval¬ 
uation [22] inspired our analysis of human-performance. 
Finally, our NN-baseline is closely inspired by non- 
parametric approaches to pose estimation [39]. In par¬ 
ticular, we make use of volumetric depth features in 
a 3D scanning-window (or volume) framework, similar 
to [41]. However, our baseline does not require SVM 
training or multi-cue features, making it considerably 
simpler to implement. 


2 Testing Data 

Test scenarios for depth-based hand-pose estimation 
have evolved rapidly. Early work evaluated on synthetic 
data, while contemporary work almost exclusively eval¬ 
uates on real data. However, because of difficulties in 
manual annotation (a point that we will revisit), eval¬ 
uation was not always quantitative - instead, it has 
been common to show select frames to give a quali¬ 
tative sense of performance However, we 

fundamentally assume that quantitative evaluation on 
real data will be vital for continued progress. 

Test set properties: We have tabulated a list of con¬ 
temporary test benchmarks in Table giving URLs 
on our websit^ We refer the reader to the caption 
for a detailed summary of specific dataset properties. 
Per dataset. Fig. visualizes the pose-space covered 
using multi-dimensional scaling (MDS). We plot both 
joint positions (in a normalized coordinate frame that 
is centered and scaled) and joint angles. Importantly, 

^ http://www.ics.uci.edu/~jsupanci/#HandData 
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the position plot takes the global orientation (or cam¬ 
era viewpoint) of the hand into account while the an¬ 
gle plot does not. Most datasets are diverse in terms 
of joint angles but many are limited in terms of posi¬ 
tions (implying they are limited in viewpoint). Indeed, 
we found that previous datasets make various assump¬ 
tions about articulation, viewpoint, and perhaps most 
importantly, background clutter. Such assumptions are 
useful because they allow researchers to focus on par¬ 
ticular aspects of the problem. However it is crucial to 
make such assumptions explicit [48], which much prior 
work does not. We do so below. 


Articulation: Many datasets focus on pose estimation 
with the assumption that detection and overall hand 
viewpoint is either given or limited in variation. Exam¬ 
ple datasets include MSRA [33], A-Star m, and Dex¬ 
ter [42]. We focus on ICL [45] as a representative ex¬ 
ample for experimental evaluation because it has been 
used in multiple prior published works [45ll6]. 


Art. and viewpoint: Other testsets have focused on 
both viewpoint variation and articulation. FORTH [28] 
provides five test sequences with varied articulations 
and viewpoints, but these are unfortunately unanno¬ 
tated. In our experiments, we analyze the NYU dataset 
m because of its wider pose variation (see Fig. and 
accurate annotations (see Sec.|^. 


Art. + View. + Clutter: The most difficult datasets 
contain cluttered backgrounds that are not easy to seg¬ 
ment away. These datasets tend to focus on “in-the- 
wild” hands undergoing activities and interacting with 
nearby objects and surfaces. The KTH Dataset m 
provides a rich set of 3rd person videos showing hu¬ 
mans interacting with objects. Unfortunately, annota¬ 
tions are not provided for the hands (only the objects). 
The UCI-EGO [35] dataset provides challenging se¬ 
quences from an egocentric perspective, and so is in¬ 
cluded in our benchmark analysis. 


Our testset: Our empirical evaluation will show that 
in-the-wild hand activity is still challenging. To push 
research in this direction, we have collected and anno¬ 
tated our own testset of real images (labeled as Ours in 
Table . As far as we are aware, our dataset is the first 
to focus on hand pose estimation across multiple sub¬ 
jects and multiple cluttered scenes. This is important, 
because any practical application must handle diverse 
subjects, scenes, and clutter. 


ICL 

A-STAR 

NYU 

FORTH 

UCI-EGO 

Ours 



(a) Position 


(b) Angle 


Fig. 2 Pose variation; We use MDS (multi-dimensional 
scaling) to plot the pose space covered by various hand 
datasets. For each testset, we plot the convex hull of its poses. 
We plot joint positions (left) and joint angles (right). In terms 
of joint angle coverage (which does not consider the “root” 
orientation of the hand itself), most datasets are similar. In 
terms of joint position, some datasets are limited because 
they consider a smaller range of viewpoints (e.g., ICL and 
A-STAR). We further analyze various assumptions made by 
datasets in the text. 



3 Training Data 

Here we discuss various approaches for generating train¬ 
ing data. Real annotated training data has long been 
the gold standard for supervised learning. However, the 
generally accepted wisdom (for hand pose estimation) 
is that the space of poses is too large to manually an¬ 
notate. This motivates approaches to leverage synthet¬ 
ically generated training data, discussed further below. 


Real data -h manual annotation: Arguably, the space 
of hand poses exceeds what can be sampled with real 
data. Our experiments identify a second problem: per¬ 
haps surprisingly, human annotators often disagree on 
pose annotations. For example, in our testset, human 
annotators visually disagreed on 20% of pose annota¬ 
tions (given a visually-acceptable threshold of 20mm) 
as plotted in Fig. These disagreements arise from 
limitations in the raw sensor data, either due to poor 
resolution or occlusions (as shown in Sec. 5.2). These 
ambiguities are often mitigated by placing the hand 
close to the camera [45ll33ll51j . As an illustrative exam¬ 
ple, we evaluate the ICL training set [45] . 


Real data -h automatic annotation: Data gloves directly 
obtain automatic pose annotations for real data m- 
However, they require painstaking per-user calibration 
and distort the hand shape that is observed in the depth 
map. Alternatively, one could use a “passive” motion 
capture system. We evaluate the NYU training set m 
that annotates real data by fitting (offline) a skinned 3D 
hand model to high-quality 3D measurements. 


Quasi-synthetic data: Augmenting real data with geo¬ 
metric computer graphics models provides an attractive 
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Fig. 3 libhand joints: We use the above joint identifiers to 
describe how we sample poses (for libhand) in table Please 
see http://www.libhand.org/ for more details on the joints 
and their parameters. 

solution. For example, one can apply geometric trans¬ 
formations (e.g., rotations) to both real data and its 
annotations [45]. If multiple depth cameras are used 
to collect real data (that is then registered to a model), 
one can synthesize a larger set of varied viewpoints m- 
Finally, mimicking the noise and artifacts of real data 
is often important when using synthetic data. Domain 
transfer methods [6] learn the relationships between a 
small real dataset and large synthetic one. 

Synthetic data: Another hope is to use data rendered 
by a computer graphics system. Graphical synthesis 
sidesteps the annotation problem completely: precise 
annotations can be rendered along with the features. 
When synthesizing novel exemplars, it is important 
define a good sampling distribution. The UCI-EGO 
training set |35| synthesizes data with an egocentric 
prior over viewpoints and grasping poses. A common 
strategy for generating a sampling distribution is to col¬ 
lect pose samples with motion capture data mM- 

3.1 libhand training set: 

To further examine the effect of training data, we cre¬ 
ated a massive custom training set of 25,000,000 RGB- 
D training instances with the open-source libhand 
model. We modified the code to include a forearm and 
output depth data, semantic segmentations, and key- 
point annotations. We emphasize that this synthetic 
training set is distinct from our new test dataset of real 
images. 

Synthesis parameters: To avoid biasing our synthetic 
training set away from unlikely, but possible, poses we 
do not use motion capture data. Instead, we take a 
brute-force approach based on rejection-sampling. We 


Dataset 

Generation 

Viewpoint 

Views 

Size 

Subj. 

ICL I45II 

Real + manual annot. 

3rd Pers. 

1 

331,000 

10 

NYU I47II 

Real + auto annot. 

3rd Pers. 

3 

72,757 

1 

UCI-EGO 1^ 

Synthetic 

Egocentric 

1 

10,000 

1 

libhand 150 II 

Synthetic 

Generic 

1 

25,000,000 

1 


Table 3 Training data sets: We broadly categorize train¬ 
ing datasets by the method used to generate the data and 
annotations: real data -h manual annotations, real data -h au¬ 
tomatic annotations, or synthetic data (and automatic anno¬ 
tations). Most existing datasets are viewpoint-specific (tuned 
for 3rd-person or egocentric recognition) and limited in size 
to tens of thousands of examples. NYU is unique in that it 
is a multiview dataset collected with multiple cameras, while 
ICL contains shape variation due to multiple (10) subjects. 
To explore the effect of training data, we use the public lib¬ 
hand animation package to generate a massive training set of 
25 million examples. 

uniformly and independently sample joint angles (from 
a bounded range), and throw away invalid samples that 
yield self-intersecting 3D hand poses. Specifically, using 
the libhand joint identifiers shown in figurewe gener¬ 
ate poses by uniformly sampling from bounded ranges, 
as shown in Table. [2j 

Quasi-Synthetic backgrounds: An under-emphasized 
aspect of synthetic training data is the choice of syn¬ 
thetic backgrounds. For methods operating on pre¬ 
segmented images this is likely not an is¬ 

sue. However, for active hands “in-the-wild”, the choice 
of synthetic backgrounds, surfaces, and interacting ob¬ 
jects is likely important. Moreover, some systems re¬ 
quire an explicit negative set (of images not con¬ 
taining hands) for training. To create such a back¬ 
ground/negative training set, we take a quasi-synthetic 
approach by applying random affine transformations 
to 5,000 images of real scenes, yielding a total of 
1,000,0000 pseudo-synthetic backgrounds. We found it 
useful to include human bodies in the negative set be¬ 
cause faces are common distractors for hand models. 

4 Methods 

Next we survey existing approaches to hand pose es¬ 
timation (summarized in Table [^. We conclude by 
introducing a simple volumetric nearest-neighbor (NN) 
baseline. 


4.1 Taxonomy 

Trackers versus detectors: We focus our analysis on 
single-frame methods. For completeness, we also con¬ 
sider several tracking baselines [28[l32p i8] needing 
ground-truth initialization. Manual initialization may 
provide an unfair advantage, but we will show that 
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Description 

Identifiers 

bend 

side 

elongation 

Intermediate and Distal Joints 
Proximal-Carpal Joints 
Thumb Metacarpal 

Thumb Proximal 

Wrist Articulation 

Fi:4,2:3 

Fi:4,4 

F5,4 

F5,3 

Pi 


0 

.S’-) 

0 

0 

H(.8’^,1.2’^) 

0 

0 


Table 2 Synthetic hand distribution; We render synthetic hands with joint angles sampled from the above uniform 
distributions, bend refers to the natural extension-retraction of the finger joints. The proximal-carpal, wrist and thumb joints 
are additionally capable of side-to-side articulation. We do not consider a third type of articulation, twist, because it would 
be extremely painful and result in injury. We model anatomical differences by elongating some bones fanning out from a 
joint. Additionally, we apply an isotropic global metric scale factor sampled from the range t^(|, §)• Finally, we randomize 
the camera viewpoint by uniformly sampling tilt, yaw and roll from U{0, 27r). 


Method 

Approach 

Model-drv. 

Data-drv. 

Detectiou 

Implemeutatiou 

FPS 

Simulate 12311 

Tracker (simulatiou) 

Yes 

No 

luitializatiou 

Published 

50 

NiTE2 |32] 

Tracker (pose search) 

No 

Yes 

luitializatiou 

Public 

> 60 

Particle Swarm Opt. (PSO) 12811 

Tracker (PSO) 

Yes 

No 

luitializatiou 

Public 

15 

Hough Forest ISlII 

Decisiou forest 

Yes 

Yes 

Decisiou forest 

Ours 

12 

Raudom Decisiou Forest (RDF) 120 II 

Decisiou forest 

No 

Yes 

- 

Ours 

8 

Lateut Regressiou Forest (LRF) 145 II 

Decisiou forest 

No 

Yes 

- 

Published 

62 

DeepJoiut I47II 

Deep uetwork 

Yes 

Yes 

Decisiou forest 

Published 

25 

DeepPrior 12611 

Deep uetwork 

No 

Yes 

Scauuiug wiudow 

Ours 

5000 

DeepSegmeut 11211 

Deep uetwork 

No 

Yes 

Scauuiug wiudow 

Ours 

5 

lutel PXC 118 II 

Morphology (couvex detectiou) 

No 

No 

Heuristic segmeut 

Public 

> 60 

Cascades 13511 

Hierarchical cascades 

No 

Yes 

Scauuiug wiudow 

Provided 

30 

FPM [53] 

Deformable part model 

No 

Yes 

Scauuiug wiudow 

Ours 

1/2 

Volumetric Exemplars 

Nearest ueighbor (NN) 

No 

Yes 

Scauuiug volume 

Ours 

1/15 


Table 4 Summary of methods; We broadly categorize the pose estimation systems that we evaluate by their overall 
approach: decision forests, deep models, trackers, or others. Though we focus on single-frame systems, we also evaluate trackers 
by providing them manual initialization. Model-driven methods make use of articulated geometric models at test time, while 
data-driven models are trained beforehand on a training set. Many systems begin by detecting hands with a Hough-transform 
or a scanning window/volume search. Finally, we made use of public source code when available, or re-implemented the system 
ourselves, verifying our implementation’s accuracy on published benchmarks. ‘Published’ indicates that published performance 
results were used for evaluation, while ‘public’ indicates that source code was available, allowing us to evaluate the method on 
additional testsets. We report the fastest speeds (in FPS), either reported or our implementation’s. 


single-frame methods are still nonetheless competi¬ 
tive, and in most cases, outperform tracking-based ap¬ 
proaches. One reason is that single-frame methods es¬ 
sentially “reinitialize” themselves at each frame, while 
trackers cannot recover from an error. 

Data-driven versus model-driven: Historic attempts to 
estimate hand pose optimized a geometric model to fit 
observed data [7 | [ H [^ . Recently, Oikonomidis et al. [28] 
achieved success using GPU accelerated Particle Swarm 
Optimization. However, such optimizations remain no¬ 
toriously difficult due to local minima in the objective 
function. As a result, model driven systems have found 
their successes mostly to the tracking domain, where 
initialization constrains the search space [42l[^[33] . For 
single image detection, various fast classifiers [2Q1IT8] 
have obtained real-time speeds. Most of the systems 
we evaluate fall into this category. When these classi¬ 
fiers are trained with data synthesized from a geomet¬ 
ric model, they can be seen as efficiently approximating 
model fitting. 

Multi-stage pipelines: It is common to treat the ini¬ 
tial detection (candidate generation) stage as separate 


from hand-pose estimation. Some systems use special 
purpose detectors as a “pre-processing” stage 
[TTlfTSlI^ [| 16 [ [51136] . Others use a geometric model for 
inverse-kinematic (IK) refinement/validation during a 
“post-processing” stage [5Tll47[l23l[42] . A segmentation 
pre-processing stage has been historically popular. Typ¬ 
ically, the depth image is segmented with simple mor¬ 
phological operations m or the RGB image is seg¬ 
mented with skin classifiers m- allowing features such 
as Zernike moments [5] or skeletonizations m to be 
computed. The latter appears difficult to generalize 
across subjects and scenes with varying lighting [33] . 
We evaluate a depth-based segmentation system [18] 
for completeness. 

4.2 Architectures 

In this section, we describe popular architectures for 
hand-pose estimation, placing in bold those systems 
that we empirically evaluate. 

Deeision forests: Decision forests constitute a domi¬ 
nant paradigm for estimating hand pose from depth. 
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Hough Forests [5T] take a two-stage approach of hand 
detection followed by pose estimation. Random De¬ 
cision Forests (RDFs) [20] and Latent Regression 
Forests (LRFs) [45] leave the initial detection stage 
unspecified, but both make use of coarse-to-fine deci¬ 
sion trees that perform rough viewpoint classification 
followed by detailed pose estimation. We experimented 
with several detection front-ends for RDFs and LRFs, 
finally selecting the first-stage detector from Hough 
Forests for its strong performance. 

Part model: Pictorial structure models have been pop¬ 
ular in human body pose estimation [52], but they ap¬ 
pear rare in hand pose estimation. For completeness, 
we evaluate a deformable part model defined on depth 
image patches m- We specifically train an exemplar 
part model (EPM) constrained to model deforma¬ 
tions consistent with 3D exemplars [53], which will be 
described further in a tech report. 

Deep models: Recent systems have explored the use of 
deep neural nets for hand pose estimation. We consider 
three variants in our experiments. Deep Joint [47] uses 
a three stage pipeline that initially detects hands with a 
decision forest, regresses joint locations with a deep net¬ 
work, and finally refines joint predictions with inverse 
kinematics (IK). DeepPrior [26] is based on a similar 
deep network, but does not require an IK stage and in¬ 
stead relies on the network itself to learn a spatial prior. 
DeepSeg [ 12 ] takes a pixel-labeling approach, predict¬ 
ing joint labels for each pixel, followed by a clustering 
stage to produce joint locations. This procedure is rem¬ 
iniscent of pixel-level part classification of Kinect 001 , 
but substitutes a deep network for a decision forest. 

4.3 Volumetric exemplars 

We propose a nearest-neighbor (NN) baseline for addi¬ 
tional diagnostic analysis. Specifically, we convert depth 
map measurements into a 3D voxel grid, and simulta¬ 
neously detect and estimate pose by scanning over this 
grid with volumetric exemplar templates. 

Voxel grid: Depth cameras report depth as a function 
of pixel (rt, v) coordinates: D{u^ v). To construct a voxel 
grid, we first re-project these image measurements into 
3D using known camera intrinsics /u, /v 

I u V \ 

(x, y,z) = I —D{u, v), -rD{u, v), D{u, v) ] (1) 

Given a test depth image, we construct a binary voxel 
grid V[x^y^z] that is ‘ 1 ’ if a depth value is observed at a 


quantized (x, z) location. To cover the rough viewable 
region of a camera, we define a coordinate frame of 
voxels, where M = 200 and each voxel spans lOmm^. 
We similarly convert training examples into volumetric 
exemplars E[x^y^z]^ but instead use a smaller grid 
of voxels (where N = 30), consistent with the size of a 
hand. 

Occlusions: When a depth measurement is observed 
at a position {x'^y'^z') = 1, all voxels behind it are 
occluded z > z'. We define occluded voxels to be ‘ 1 ’ for 
both the test-time volume V and training exemplar E. 

Distanee measure: Let Vj be the subvolume (of size 
N^) extracted from V, and let Ei be the exemplar. 
We simultaneously detect and estimate pose by com¬ 
puting the best match in terms of Hamming distance: 

(i*^j*) = argminDist(F^i, 4^) where ( 2 ) 

hj 

Dist{Ei,Vj) = y] I{Ei[x,y,z] 7 ^ Vj[x,y,z]), (3) 

x,y,z 

such that E is the best-matching training exemplar and 
is its detected position. 

Effieient seareh: A naive search over exemplars and 
subvolumes is prohibitively slow. But because the un¬ 
derlying features are binary and sparse, there exist con¬ 
siderable opportunities for speedup. We outline two 
simple strategies. First, one can eliminate subvolumes 
that are empty, fully occluded, or out of the cam¬ 
era’s field-of-view. Song et al. [41] refer to such pruning 
strategies as “jumping window” searches. Second, one 
can compute volumetric Hamming distances with 2D 
computations: 

Dist(£’,, Vj) = y] \ei[x, y] - Vj [a;, y] \ where (4) 

x,y 

ei[x,y] = Yi [^> 2/, , Vj[x,y]='^ Vj[x,y,z]. 

Z Z 

Intuitively, because our 3D volumes are projections of 
2.5D measurements, they can be sparsely encoded with 
a 2D array (see Fig. [^. Taken together, our two sim¬ 
ple strategies imply that a 3D volumetric search can be 
made as practically efficient as a 2D scanning-window 
search. For a modest number of exemplars, our imple¬ 
mentation still took tens of seconds per frame, which 
sufficed for our offline analysis. We posit faster NN al¬ 
gorithms could produce real-time performance 
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Fig. 4 Volumetric Hamming 
distance: We visualize 3D voxels 
corresponding to an exemplar (a) 
and subvolume (b). For simplic¬ 
ity, we visualize a 2D slice along 
a fixed y-value. Because occluded 
voxels are defined to be ‘1’ (indi- 

► eating they are occupied, shown 
in blue) the total Hamming dis¬ 
tance is readily computed by the 
LI distance between projections 
along the z-axis (c), mathemati- 

► cally shown in Eq.0. 

Fig. 5 Windows v. volumes: 2D scan¬ 
ning windows (a) versus 3D scanning vol- (a) 
umes (b). Volumes can ignore background 
clutter that lie outside the 3D scanning vol¬ 
ume but still fall inside its 2D projection. 

For example, when scoring the above hand, 
a 3D scanning volume will ignore depth (t>) 
measurements from the shoulder and head, 
unlike a 2D scanning window. 




Comparison: Our volumetric exemplar baseline uses a 
scanning volume search and 2D depth encodings. It is 
useful to contrast this with a “standard” 2D scanning- 
window template on depth features m- First, our ex¬ 
emplars are defined in metric coordinates (Eq. [^. This 
means that they will not fire on the small hands of 
a toy figurine, unlike a scanning window search over 
scales. Second, our volumetric search ensures that the 
depth encoding from a local window contain features 
only within a fixed volume. This gives it the ability 
to segment out background clutter, unlike a 2D window 
(Fig.|5|. 


5 Protocols 

5.1 Evaluation 

Reprojection error: Eollowing past work, we evaluate 
pose estimation as a regression task that predicts a 
set of 3D joint locations |45l[^[33[l46[l2Qj . Given a pre¬ 
dicted and ground-truth pose, we compute both the av¬ 
erage and max 3D reprojection error (in mm) across all 
joints. We use the skeletal joints defined by libhand m- 
We then summarize performance by plotting the pro¬ 
portion of test frames whose average (or max) error falls 
below a threshold. 

Error thresholds: Much past work considers perfor¬ 
mance at fairly low error thresholds, approaching 
10mm Interestingly, [26] show that estab¬ 

lished benchmarks such as the ICL testset include an¬ 
notation errors of above 10mm in over a third of their 


10mm 50mm 100mm 200mm 



Fig. 6 Our error criteria: For each predicted hand, we 
calculate the average and maximum distance (in mm) be¬ 
tween its skeletal joints and a ground-truth. In our exper¬ 
imental results, we plot the fraction of predictions that lie 
within a distance threshold, for various thresholds. This figure 
visually illustrates the misalignment associated with various 
thresholds for max error. A 50mm max-error seems visually 
consistent with a “roughly correct pose estimation”, and a 
100mm max-error is consistent with a “correct hand detec¬ 
tion” . 


frames. Ambiguities arise from manual labeling of joints 
versus bones and centroids versus surface points. We 
rigorously evaluate human-level performance through 
inter-annotator agreement on our new testset (Eig. 14). 
Overall, we find that max-errors of 20mm approach 
the limit of human accuracy for closeby hands. We 
present a qualitative visualization of max error at differ¬ 
ent thresholds in Eig.[^ 50mm appears consistent with 
a roughly correct pose, while an error within 100mm 
appears consistent with a correct detection. Our qual¬ 
itative analysis is consistent with empirical studies of 
human grasp [2] and gesture m which also suggest 
that 50mm is sufficient to capture difference in ges¬ 
ture or grasp. Eor completeness, we plot results across a 
large range of thresholds, but highlight 50 and 100mm 
thresholds for additional analysis. 


Detection issues: Reprojection error is hard to define 
during detection failures: that is, false positive hand 
detections or missed hand detections. Such failures are 
likely in cluttered scenes or when considering scenes 
containing zero or two hands. If a method produced zero 
detections when a hand was present, or produced one 
if no hand was present, this was treated as a “maxed- 
out” reprojection error (of oo mm). If two hands were 
present, we scored each method against both and took 
the minimum error. Though we plan to release our eval¬ 
uation software, we give pseudocode in Alg. 

Missing data: Another challenge with reprojection er¬ 
ror is missing data. Eirst, some methods predict 2D 
rather than 3D joints [T8l[3T[l47l[T^ . Inferring depth 
should in theory be straightforward with Eq. but 
small 2D errors in the estimated joint can cause signif¬ 
icant errors in the estimated depth. We report back 
the centroid depth of a segmented/detected hand if 
the measured depth lies outside the segmented volume. 
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input : predictions and ground truths for each image 
output: a set of errors, one per frame 
forall the test-images do 

P ^ method’s most confident prediction; 

G ground truths for the current test-image; 
if G = 0 then 

/* Test Image contains zero hands */ 

if P = 0 then 

I errors ^ errors U {0}; 
else 

I errors ^ errors U {oo}; 

end 

else 

/* Test Image contains hand(s) */ 

if P = 0 then 

I errors ^ errors U {oo}; 
else 

best-error ^ oo; 

/* Find the ground truth best 

matching the method’s prediction 
*/ 

forall the H E G do 

/* For mean error plots, replace 
maxi with mean^ */ 

/* V denotes the set of visible 
joints */ 

current-error ^ max^^v \ \Hi — Pi\\ 2 ] 
if eurrent-error < best-error then 
I best-crror ■<— current-error; 
end 

end 

errors ^ errors U {best.error}; 

end 

end 

end 

Algorithm 1: Scoring Procedure: For each frame 
we compute a max or mean re-projection error for 
the ground truth(s) G and prediction(s) P. We later 
plot the proportion of frames with an error below a 
threshold, for various thresholds. 

Past comparisons appear not to do this [26] , somewhat 
unfairly penalizing 2D approaches m- Second, some 
methods may predict a subset of joints [T81I3T]. To en¬ 
sure a consistent comparison, we force such methods 
to predict the locations of visible joints with a post¬ 
processing inverse-kinematics (IK) stage [47]. We fit the 
libhand kinematic model to the predicted joints, and 
infer the location of missing ones. Third, ground-truth 
joints may be occluded. By convention, we only evalu¬ 
ate visible joints in our benchmark analysis. 

Implementations: We use public code when available 
[28ll32[ll8j . Some authors responded to our request for 
their code [35]. When software was not available, we at¬ 
tempted to re-implement methods ourselves. We were 
able to successfully reimplement IM1I5I1I2Q], matching 
the accuracy on published results |45l[^ . In other 
cases, our in-house implementations did not suffice ga 


[45] . For these latter cases, we include published per¬ 
formance reports, but unfortunately, they are limited 
to their own datasets. This partly motivated us to per¬ 
form a multi-dataset analysis. In particular, previous 
benchmarks have shown that one can still compare al¬ 
gorithms across datasets using head-to-head matchups 
(similar to approaches used to rank sports teams that 
do not directly compete [29]). We use our NN baseline 
to do precisely this. Finally, to spur further progress, 
we will make all implementations publiely available, to¬ 
gether with our evaluation eode. 


5.2 Annotation 

We now describe how we collect ground truth anno¬ 
tations. We present the annotator with cropped RGB 
and Depth images. They then click semantic key-points, 
corresponding to specific joints, on either the RGB or 
Depth images. To ease the annotator’s task and to 
get 3D keypoints from 2D clicks we invert the for¬ 
ward rendering (graphics) hand model provided by lib- 
hand which projects model parameters 0 to 2D key- 
points P{0). While they label joints, an inverse kine¬ 
matic solver minimizes the distance between the cur¬ 
rently annotated 2D joint labels, Mj^jLj, and those pro¬ 
jected from the libhand model parameters, \/j^jPjiO). 

- PjiVh (5) 

jeJ 

The currently fitted libhand model, shown to the an¬ 
notator, updates online as more joints are labeled. 
When the annotator indicates satisfaction with the fit¬ 
ted model, we proceed to the next frame. We give an 
example of the annotation process in figure 

Strengths: Our annotation process has several 
strengths. First, kinematic constraints prevent some 
possible combination of keypoints: so it is often pos¬ 
sible to fit the model by labeling only a subset of key- 
points. Second, the fitted model provides annotations 
for occluded keypoints. Third and most importantly, 
the fitted model provides 3D (x,y,z) keypoint locations 
given only 2D (u,v) annotations. 

Disagreements: As shown in in Fig. [^ annota¬ 
tors disagree substantially on the hand pose, in a sur¬ 
prising number of cases. In applications, such as sign 
language [44] ambiguous poses are typically avoided. 
We believe it is important to acknowledge that, in gen¬ 
eral, it may not be possible to achieve full precision. 
Figure [^ illustrates two examples of these annotator 
disagreements. 
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Depth 


RGB 


Fitted 

model 


Annotator #1 Annotator #2 Annotator #1 Annotator #2 



Frame A Frame B 


Fig. 8 Annotator disagreements: With whom do you agree? We show two frames where annotators disagreed. The top 
two rows show the RGB and depth images presented and the keypoint annotations received from the annotator. The bottom 
row shows the libhand model fitted to those keypoint annotations. 

In Frame A, the confusion revolves about the thumb position. Is the thumb occluded, folded down behind the other digits, 
or does it stand upright? The resolution, in both color and depth makes this hard to decide. Long range (low resolution) 
scenarios are important; But in these scenarios we cannot expect performance comparable to that found in near range. 

Similarly, in Frame B one finger is occluded, but which one? Annotator 1 believes the thumb is occluded. Annotator 2 
believes the pinky is occluded. The fitted libhand models show that either interpretation is plausible. In this author’s opinion, 
annotator 1 is more consistent with the RGB evidence while annotator 2 is more consistent with the Depth evidence. 


RGB Depth LibHand 

”iii# C' 

Fig. 7 Annotation procedure: We annotate until we are 
satisfied that the fitted hand pose matches the RGB and 
Depth data. The first two columns show the image evidence 
presented and keypoints received. The right most column 
shows the fitted libhand model. (A) the IK solver is able to 
easily fit a model to the five given keypoints, but it doesn’t 
match the image well. (B) The annotator attempts to cor¬ 
rect the model, to better match the image, by labeling the 
wrist. (G) Labeling additional finger joints finally yields and 
acceptable solution. 


Training Set 


ICL 

NYU 

Ego 

libhand 

ICL 

* 

O 00 

O 

6% 

^ 57% 

1% 

451^32% 

j5I^8% 

SI 0 ^ 46 % 

1 

^ 64% 
^ 92% 

27% 

^ 0% 

82% 

EGO 

o 

0% 

1% 

00 0 

0 % 

Ours 

o o 

4r 19% 

W 86% 

11% 

|A 9 % 

^70% 


Table 5 Cross-dataset generalization: We compare 
training and test sets using a 1-NN classifier. Diagonal en¬ 
tries represent the performance using corresponding train and 
test sets. In each grid entry, we denote the percentage of test 
frames that are correct (50mm max-error, above, and 50mm 
average-error, below) and visualize the median error using the 
colored overlays from Fig. We account for sensor specific 
noise artifacts using established techniques [3]. Please refer 
to the text for more details. 
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ICL Test Set [45] 



0 10 20 30 40 50 60 70 80 90 100 


max joint error threshold (mm) 



0 10 20 30 40 50 60 70 80 90 100 

mean joint error threshold (mm) 

NN-Ego —NN-NYU — 

NN-ICL —4^ NN-libhand 

Hough [51] RDF [20] Simulation [23] DeepPrior [26] 

LRF [45] 

Fig. 9 We plot results for several systems on the ICL test- 
set using max-error (top) and average-error (bottom). Except 
for 1-NN, all systems are trained on the corresponding train 
set (in this case ICL-Train). To examine cross-dataset gen¬ 
eralization, we also plot the performance of our NN-baseline 
constructed using alternate sets (NYU, EGO, and libhand). 
When trained with ICL, NN performs as well or better than 
prior art. One can find near-perfect pose matches in the train¬ 
ing set (see Fig. Q. Please see text for further discussion. 


6 Results 

We now report our experimental results, comparing 
datasets and methods. We first address the “state of 
the problem”: what aspects of the problem have been 
solved, and what remain open research questions? We 
conclude by discussing the specific lessons we learned 
and suggesting directions for future systems. 

Mostly-solved (distinct poses): Fig. shows that hand 
pose estimation is mostly solved on datasets of unclut¬ 
tered scenes where hands face the camera (i.e. ICL). 



Fig. 10 Min vs max error: Compared to state-of-the- 
art, our 1-NN baseline often does relatively better under the 
average-error criterion than under the max-error criterion. 
When it can find (nearly) an exact match between training 
and test data (left) it obtains very low error. However, it does 
not generalize well to unseen poses (right). When presented 
with a new pose it will often place some fingers perfectly but 
others totally wrong. The result is a reasonable mean error 
but a high max error. 


Deep models, decision forests, and NN all perform quite 
well, both in terms of articulated pose estimation (85% 
of frames are within 50mm max-error) and hand de¬ 
tection (100% are within 100mm max-error). Surpris¬ 
ingly, NN outperforms decision forests by a bit. How¬ 
ever, when NN is trained on other datasets with larger 
pose variation, performance is considerably worse. This 
suggests that the test poses remarkably resemble the 
training poses. But, this may be reasonable for applica¬ 
tions targeting sufficiently distinct poses from a finite 
vocabulary (e.g., a gaming interface). These results sug¬ 
gest that the state-of-the-art accurately predicts distinct 
poses (i.e.50 mm apart) in uncluttered scenes. 


Major progress (unconstrained poses): The NYU test- 
set still considers isolated hands, but includes a wider 
range of poses, viewpoints, and subjects compared to 
ICL (see Fig. |^. Fig. reveals that deep models per¬ 


form the best for both articulated pose estimation (96% 
accuracy) and hand detection (100% accuracy). While 
decision forests struggle with the added variation in 
pose and viewpoint, NN still does quite well. In fact, 
when measured with average (rather than max) error, 
NN nearly matches the performance of m- This sug¬ 
gests that exemplars get most, but not all fingers, cor¬ 
rect (see Fig. 1^. Overall, we see noticeable progress on 
unconstrained pose estimation since 2007 [10] . 


Unsolved (low-res, objects, occlusions, clutter): When 


considering datasets (Fig.[l4|and 15) with distant (low- 
res) hands and background clutter due to objects or 
interacting surfaces (Fig. |^, results are significantly 
worse. Note that many applications [40] often demand 
hands to lie at distances greater than 750mm. For such 
scenes, hand detection is still a challenge. Scanning win¬ 
dow approaches (such as our NN baseline) tend to out¬ 
perform multistage pipelines Enma, which may make 
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Fig. 11 Complex backgrounds: Most existing systems, 
including our own 1-NN baseline, fail when challenged with 
complex backgrounds which cannot be trivially segmented. 
These backgrounds significantly alter the features extracted 
and processed and thus prevent even the best models from 
producing sensible output. 

NYU Test Dataset |47) 


(a) Latent Hough Detection 



(b) Hough orientation failure 


^2: 


(c) per-pixel classification 



(d) hard segmentation 





max joint error threshold (mm) 



mean joint error threshold (mm) 

NN-Ego -A.- NN-NYU — 

NN-ICL —NN-libhand 

Hough [5T] rdf [ 20 ] DeepJoint |47] DeepPrior 

Fig. 12 Deep models j47ll26| perform noticeably better than 
other systems, and appear to solve both articulated pose esti¬ 
mation and hand detection for uncluttered single-user scenes 
(common in the NYU testset). However, the other systems 
compare more favorably under average error. In Fig. we 
interpret this disconnect by using 1-NN to show that each 
test hand commonly matches a training example in all but 
one finger. Please see text for further discussion. 


Fig. 13 Many approach the problem of hand pose estima¬ 
tion in three phases: (1) detect and segment (2) estimate pose 
(3) validate or refine [511120114711451118] . However, when an ear¬ 
lier stage fails, the later stages are often unable to recover. 
When detection and segmentation are non-trivial, this be¬ 
comes to root cause of many failures. For example. Hough 
forests m (a) first estimate the hand’s location and orien¬ 
tation. They then convert to a cardinal translation and ro¬ 
tation before estimating joint locations, (b) When this first 
stage fails, the second stage cannot recover, (c) Other meth¬ 
ods assume that segmentation is solved [201112] . (d) when 
background clutter is inadvertently included by the hand seg- 
menter, the finger pose estimator is prone to spurious outputs. 

an unrecoverable error in the first (detection and seg¬ 
mentation) stage. We show some illustrative examples 
in Fig.[^ However, overall performance is still lacking, 
particularly when compared to human performance. 
Though interestingly, human (annotator) accuracy also 
degrades for low-resolution hands far away from the 
camera (Fig.p!4|). Our results suggest that scenes of in- 
the-wild hand activity are still beyond the reach of the 
state-of-the-art. 

Training data: We use our NN-baseline to analyze the 
effect of training data in Table Our NN model 
performed better using the NYU training set m 
(consisting of real data automatically labeled with a 
geometrically-fit 3D CAD model) than with the libhand 
training set. While performance increases by enlarging 
the synthetic training set (Fig. [T^, this quickly be¬ 
comes intractable. This reflects the difflculty in using 
synthetic data: one must carefully model priors [26] . 
sensor noise, m and hand shape variations between 
users |46]. Moreover, in some cases, the variation in the 
performance of NN (dependent on the particular train¬ 
ing set) exceeded the variation between model architec¬ 
tures (decision forests versus deep models) - Fig.[^ Our 
results suggest the diversity and realism of the training 
set is as important than the model form learned from 
it. 
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Our Test Dataset - All Hands 



max joint error threshold (mm) 

Our Test Dataset - Near Hands (< 750mm) 





max joint error threshold (mm) 
NN-Ego —A— NN-NYU — 
NN-ICL —NN-libhand 
Hough [51| Cascades [35| 


Human 
EPM [53] 


mean joint error threshold (mm) 
DeepPrior [26] NiTE2 [32] PSO [28] 
DeepSeg [12] RDF [20] PXC [E] 


Fig. 14 We designed our dataset to address the remaining challenges of in “in-the-wild” hand pose estimation, including scenes 
with low-res hands, clutter, object/surface interactions, and occlusions. We plot human-level performance (as measured through 
inter-annotator agreement) in black. On nearby hands (within 750mm, as commonly assumed in prior work) our annotation 
quality is similar to existin g te stsets such as ICL [26]. This is impressive given that our testset includes comparatively more 
ambiguous poses (see Sec. |5.2| ). Our dataset includes far away hands, for which even humans struggle to accurately label. 
Moreoever, several methods (Cascades,PXC,NiTE2,PSO) fail to correctly localize any hand at any distance, though the mean- 
error plots are more forgiving than the max-error above. In general, NN-exemplars and DeepPrior perform the best, correctly 
estimating pose on 75% of frames with nearby hands. 


NN vs Deep models: Overall, our 1-NN baseline proved 
to be suprisingly strong, outperforming or matching the 
performance of most prior systems. This holds true even 
for moderately-sized training sets with tens of thou¬ 
sands of examples, suggesting that much prior work 
essentially memorizes training examples. One contribu¬ 
tion of our analysis is the notion that NN-exemplars 
provides a vital baseline for understanding the behavior 
of a proposed system in relation to its training set. In 
fact. Deep Joint m and DeepPrior m were the sole 
approaches to significantly outperform 1-NN (Figs. 
and 12). This indicates that deep architectures gener¬ 


alize well to novel test poses. This may contrast with 
existing folk wisdom about deep models: that the need 
for large training sets suggests that these models essen¬ 
tially memorize. Our results indicate otherwise. 

Conelusion: The past several years have shown tremen¬ 
dous progress regarding hand pose: training sets, test¬ 
ing sets, and models. Some applications, such as gam¬ 
ing interfaces and sign-language recognition, appear to 
be well-within reach for current systems. Less than a 
decade ago, this was not true isniToiiHi . Thus, we 
have made progress! But, challenges remain nonethe¬ 
less. Specifically, when segmentation is hard due to ac- 
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UCI-EGO Test Dataset [35| 




mean joint error threshold (mm) 

NN-Ego —NN-NYU 

NN-ICL —NN-libhand 

Hough [51] RDF [20] PXC [18] R. Cascades [35] 

DeepPrior |26| 

Fig. 15 For UCI-EGO, randomized cascades and our NN 
baseline do about as well, but overall, performance is con¬ 
siderably worse than other datasets. No methods are able to 
correctly estimate the pose (within 50mm) on any frames. 
Egocentric scenes contain more background clutter and ob¬ 
ject/surface interfaces, making even hand detection challeng¬ 
ing for many methods. 



Number of Training Samples 


Fig. 16 Synthetic data vs. accuracy: Synthetic training 
set size impacts performance on our test test set. Performance 
grows logarithmically with the dataset size. Synthesis is the¬ 
oretically unlimited, but practically becomes unattractively 
slow. 


tive hands or clutter, many existing methods fail. To 
illustrate these realistic challenges we introduce a novel 
testset. We demonstrate that realism and diversity in 
training sets is crucial, and can be as important as the 
choice of model architecture. In terms of model archi¬ 
tecture, we perform a broad benchmark evaluation and 
find that deep models appear particularly well-suited 
for pose estimation. Finally, we demonstrate that NN 
using volumetric exemplars provides a startlingly po¬ 
tent baseline, providing an additional tool for analyzing 
both methods and datasets. 
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